Cancer drug response prediction with surrogate modeling-based graph neural architecture search

Abstract Motivation Understanding drug–response differences in cancer treatments is one of the most challenging aspects of personalized medicine. Recently, graph neural networks (GNNs) have become state-of-the-art methods in many graph representation learning scenarios in bioinformatics. However, building an optimal handcrafted GNN model for a particular drug sensitivity dataset requires manual design and fine-tuning of the hyperparameters for the GNN model, which is time-consuming and requires expert knowledge. Results In this work, we propose AutoCDRP, a novel framework for automated cancer drug–response predictor using GNNs. Our approach leverages surrogate modeling to efficiently search for the most effective GNN architecture. AutoCDRP uses a surrogate model to predict the performance of GNN architectures sampled from a search space, allowing it to select the optimal architecture based on evaluation performance. Hence, AutoCDRP can efficiently identify the optimal GNN architecture by exploring the performance of all GNN architectures in the search space. Through comprehensive experiments on two benchmark datasets, we demonstrate that the GNN architecture generated by AutoCDRP surpasses state-of-the-art designs. Notably, the optimal GNN architecture identified by AutoCDRP consistently outperforms the best baseline architecture from the first epoch, providing further evidence of its effectiveness. Availability and implementation https://github.com/BeObm/AutoCDRP.

2 Architecture Encoding Algorithm Algorithm 2 present the proposed method for encoding a GNN architecture sampled from the search space into a direct acyclic graph.

Atom feature descriptions
The attribute of each atom node is represented as a 78-dimensional feature vector presented in Table  1.

Aggregate and Convolution and Functions
A message-passing scheme allows a node to acquire graph structure characteristics from its neighbor nodes through an aggregate function. The options for aggregation functions in the proposed search space include Add, M ean, and M ax. As different neighbor node features have varying contributions to the representation of the central node in the graph, a convolution function is used to learn the different implications of each neighbor node in order to obtain a more accurate representation of the central node. There are many options for convolution functions in the proposed search space, such as GCN Conv, GEN Conv, GIN Conv, SGConv, and Linear.

Multi-head and Hidden Dimension
As computing multiple independent convolution functions for the convolution operator is practical to stabilize the learning process, we add a multi-head component to our search space. Its options include 1, 2, 4, 6, and 8. In addition, since reducing and transforming the dimension of the original feature enhances the node's hidden representation, we also include a hidden dimension component in the search space. The hidden dimension value options are 16, 64, 128, and 256.

Normalization Function and Activation Function
In addition, we add a normalization function over the node features to enhance the expression ability of the model and an activation function to give the model a nonlinear fitting capability, which plays an important role in smoothing the hidden representation. Normalization functions available in the search space include GraphN orm, BatchN orm and N one, while activation functions include sigmoid, relu, linear, P reLU , and sof tplus.

Loss function and Optimization Function
Model training involves the use of loss and optimization functions. In order to optimize the search space of AutoCDRP to fit graphs with different distributions more effectively, we utilize various widely used loss and optimization functions. There are several options for loss functions, such as CrossEntropyLoss and NllLoss, as well as options for optimization functions, such as Adam and SGD.

Pooling Function.
The global pooling function is used to aggregate the node representations throughout the entire graph in order to obtain the graph-level representation from the node-level representations. As part of the proposed search space, graph pooling options include GlobalAddP ooling and GlobalM axP ooling.

Dropout and other hyper-parameters
With dropout operation, over-fitting can be prevented during the training process. In the proposed search space, dropout operation options include 0, 0.2, 0.4, and 0.6. Similarly, we provide options for learning rate and L2 regularization in the search space. The options for learning rate include 0.01, 0.001, 0.0001, and 0.0005, while options for L2 regularization include 0.001, 0.0001, and 0.0005.

Experiment Setup
Baseline Methods. In this study, we compare the GNN architectures output by AutoCDRP with tCNNS (Liu et al., 2019) and GraphDRP (Nguyen et al., 2021). tCNNS is a convolutional neural networks-based method where drugs are represented as SMILES string. In tCNNS a convolution layer is used to extract drug features from SMILES format and another convolution layer is used to extract cancer cell line features from genetic attribute vectors. Finally fully connected is added to the drug-cell response. GraphDRP is a graph neural network-based model where drugs are represented as graphs and cell-line as one-hot vectors. In GraphDRP, drugs and cell-line feature representations are learned by graph convolutional layers and 1D convolutional layers, respectively. Then, the concatenation of drug and cell-line representation is used to predict the IC50 value. GraphDRP uses four variants of graph neural networks for learning features of drugs including GIN, GAT, GCN, and GAT with GCN. we denote the model obtained with each of them by GraphDRP GIN, GraphDRP GAT, GraphDRP GCN, and GraphDRP GAT-GCN, respectively. Implementation Settings. To train the surrogate model, we sample 800 architectures and select k=100 best-predicted graph neural architectures for validation. Throughout all steps, the number of epochs was set to 20, except when evaluating the best architecture, where 300 is used instead. We predict the performance of 10 6 architectures.
A graph isomorphism network model with two layers is used as a surrogate model. A random 80%/20% split is chosen for the dataset split. Surrogate models take architectural graphs as inputs and predict the RMSE of architectures as scalar vectors. When the element-wise error is below β = 1, a squared term is used for loss calculation, and otherwise a mean absolute term. The Adam optimizer is used to update architecture weight. The learning rate and L2 regularization values are 0.00001 and 0.0005, respectively. In the experiments, the surrogate model is trained for 500 epochs. The experiments are implemented using Pytorch-Geometric library 1 (Fey and Lenssen, 2019) Evaluation Metrics. In order to evaluate both the performance of architectural models and the surrogate model, Pearson correlation coefficients (PCCs) and root mean square errors (RMSEs) are used. As a bivariate statistical model, Pearson correlations measure the strength of linear relationships between two variables. An absolute negative linear correlation of -1 indicates a perfect negative correlation, a constant negative correlation of zero, and a perfect positive correlation of one indicates a perfect positive correlation. A Pearson correlation can be expressed as follows: where O and Y represent the set of true ln(IC50) and predicted ln(IC50), respectively. n is the number of data points, o i and y i are the true ln(IC50) and the predicted ln(IC50) of the i th data point, respectively.
The root mean square error (RMSE) is a standard method for estimating the error in predicting quantitative data given a model. Based on the number of data points, a true ln(IC50) o i of the sample, and a predicted ln(IC50) y i of the sample, the RMSE is defined as follows: In addition, we use Kendall tau correlation to evaluate the degree of concordance between the predicted performance evaluation and the ground truth performance evaluation. Kendall tau correlation can range from +1 to -1. As a result, a perfect linear relationship will yield a correlation coefficient of 1 and no linear relationship will yield a correlation coefficient of 0. In addition, Kendall's tau correlation has a low overall error sensitivity and asymptotic variance. It can be defined as follows: where n is the number of data points and d is the difference between the number of concordant (ordered in the same way) and discordant (ordered differently) pairs.

Designed architecture by AutoCDRP
The best architectures found by AutoCDRP are reported in table 2.

Statistical test
We apply nonparametric tests that are known to be suitable for comparing predictive models based on multiple data samples (Demsar, 2006;García et al., 2010). Compared to the parametric tests, the nonparametric tests require fewer assumptions (Demsar, 2006). In this analysis, we use both GDSC  Table 3. We use the Friedman ranking test (García et al., 2010) to access the statistical significance of differences in the model's performance over the datasets, based on the Pearson correlation coefficient (PCC). The overall performance ranking of compared methods to the PCC metric is presented in Figure 1. It can be seen that the proposed AutoCDRP ranks first.
where N is the number of datasets, k is the number of methods and R j the average ranking of the j th method. With N=6 and k=8, we can get X 2 F = 23.7351 Next, we calculate F F as shown in Eq. 2. Figure 2: Nemenyi test scores. A low score means a significant difference between the performance of methods In this experiment with height algorithms and six datasets, F F = 6.497. F F is distributed according to the F-distribution with 8 − 1 = 7 and (8 − 1) × (6 − 1) = 35 degrees of freedom. According to the F-distribution table, the critical value (CV ) of F F is 1.89 for α = 0.1. As a result, the hypothesis of "equal" performance among compared methods is clearly rejected as F F > CV .
Consequently, we use the Bonferroni correction (Holm, 1979) as a post-hoc test to compute familywise p-values. In this computation, we set AutoCDRP as the control method and compute the familywise p-values. The family-wise p-values between AutoCDRP and baseline methods are 7.5224e-06, 0.0014, 0.0021, 0.0046, 0.0133, 0.0391, 0.0874 for tCNNS, GraOmicDRP, ARMA, ChebNet, GCN, GraTransDRP, and GraphDRP, respectively. It is revealed from the family-wise p-values that PG-NAS shows significantly better performance over existing graph neural architecture search baseline frameworks with α = 0.1.
Finally, we make the Nemenyi test (Demsar, 2006) to see how other baseline methods perform against each other. The Nemenyi test shows that every baseline method has significantly better performance over at most one other method. Figure 2 shows that every baseline method has significantly better performance over at most one other method, which confirms the superiority of PGNAS among compared methods.